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Abstract 

Take  a  feed-forward  layered  network  (such  as  a  multilayer 
perceptron  or  a  radial  basis  function  network)  which  is 
to  operate  as  a  pattern  classifier.  The  network  may  have 
several  hidden  layers,  as  many  nodes  as  required  and  any 
desired  nonlinearities  on  the  hidden  units.  The  transfer 
functions  of  the  output  nodes  should  be  linear.  If  the 
network  is  trained  (using  any  appropriate  problem)  to 
minimise  the  sum  squared  error  over  all  outputs  and  pat¬ 
terns  such  that  the  output  weights  have  minimum  norm, 
then  the  output  values  of  the  trained  network  for  any 
subsequent  input  pattern  will  sum  to  a  constant. 
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This  note  addresses  the  following  interesting  observation: 

Take  a  feed-forward  layered  network  (such  as  a  multi¬ 
layer  perceptron  [l,  2]  or  a  radial  basis  function  net¬ 
work  [3])  which  is  to  operate  as  a  pattern  classifier.  The 
network  may  have  several  hidden  layers,  as  many  nodes 
as  required  and  any  desired  nonlinearities  on  the  hidden 
units.  The  transfer  functions  of  the  output  nodes  should 
be  linear.  If  the  network  is  trained  (using  any  appropri¬ 
ate  problem)  to  minimise  the  sum  squared  error  over  all 
outputs  and  patterns  such  that  the  output  weights  have 
minimum  norm,  then  the  output  values  of  the  trained 
network  for  any  subsequent  input  pattern  will  sum  to  a 
constant. 


A  particular  example  of  this  behaviour  occurs  if  the  target  coding  scheme 
for  this  classifier  were  chosen  as 

p  _  1  if  pattern  p  £  class  k  ,  . 

h  0  otherwise  '  ' 

In  this  case  the  output  values  for  any  input  pattern  sum  to  unity.  This  case 
has  a  special  interest  since  it  relates  to  the  idea  from  pattern  recognition  that 
the  output  of  networks  operating  as  classifiers  ought  to  reflect  the  likelihood 
of  a  given  pattern  belonging  to  a  particular  class  [4,  5].  Unfortunately,  to 
achieve  this  sum  rule  it  is  generally  true  that  some  of  the  output  components 
have  to  assume  negative  values,  thus  invalidating  any  interpretation  in  terms 
of  classical  probability  theory.  That  such  a  sum  rule  should  hold  independent 
of  the  inputs  to  the  network  is  surprising  and  to  our  knowledge  has  not  been 
pointed  out  previously  in  the  network  literature.  A  possible  reason  for  this 
is  that  numerical  simulations  involving  the  most  common  feed-forward  net¬ 
work,  the  multilayer  perceptron,  do  not  often  use  linear  output  units  (except 
in  autoassociative  encoding  applications  [6]).  Moreover,  even  where  linear 
output  units  are  used,  training  is  often  terminated  after  a  finite  number  of 
iterations,  or  when  the  actual  outputs  are  sufficiently  close  to  saturation.  A 
prerequisite  for  the  sum  rule  to  be  observed  is  that  the  network  weights  in  the 
final  layer  have  to  be  optimum,  in  the  sense  that  a  local  minimum  of  the  sum 
squared  error  is  obtained. 

The  rest  of  this  note  is  devoted  to  a  proof  of  the  observed  sum  rule. 

The  network  is  trained  on  P  input  patterns, 

| P)  €  JR“,p  =  1,2,. ...P 
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where  each  pattern  is  represented  by  a  real  valued  vector  in  n  dimensions 
(an  extension  to  complex  valued  patterns  may  be  made).  To  each  training 
pattern  there  corresponds  a  desired  target  pattern, 

\TP)  6  Rn\p  =  1,2,...,P 

which  is  a  vector  in  n'-dimensional  space  where  each  dimension  represents 
one  of  the  n'  possible  classes  that  the  input  pattern  could  belong  to.  The 
intention  is  to  attempt  to  produce  actual  outputs  from  the  network,  | Op) 
which  are  as  close  as  possible  in  a  least  mean  squares  sense  to  the  desired 
target  patterns  by  choosing  an  appropriate  set  of  values  for  the  adjustable 
parameters  in  the  network  (i.e.  biases  and  link  weights).  The  output  of 
the  network  is  a  weighted  sum  of  the  output  pattern  of  the  final  ‘hidden’ 
layer,  which  in  turn  is  usually  a  weighted  sum  of  the  pattern  of  the  previous 
hidden  layer  passed  through  nonlinear  transfer  functions.  For  instance,  for 
a  network  with  a  single  hidden  layer  consisting  of  no  nodes,  the  output  of  the 
fc-th  output  node  of  the  network  for  pattern  p  may  be  expressed  as 

o*  =  £  +  -*o*  (2) 

J=1 

where  A,*  is  the  weight  value  linking  the  j-th  hidden  node  to  the  k- th  output 
node,  <f%  is  the  output  of  the  j— th  hidden  node  for  the  p-th  pattern  and  A 0*  is 
a  constant  ‘bias’  value  associated  with  the  k- th  output  node. 

In  general  the  nonlinear  transfer  function  of  the  y-th  hidden  node  takes 
the  form  =  <f>j  (0,-[|/')])  where  gj[\Ip)]  is  a  (scalar)  function  of  the  input 
patterns.  The  form  of  <f>j  and  g ,  differ  depending  on  the  network  being  used. 
Such  differences  are  not,  however,  important  to  this  discussion.  Whatever 
the  cho.  en  form  of  the  nonlinear  transformation,  we  are  only  concerned  with 
the  action  of  the  network  as  described  by  equation  (2).  The  method  of 
obtaining  an  appropriate  set  of  weight  values  is  to  minimise  the  error  between 
the  desired  target  patterns  and  the  actual  network  outputs.  The  network 
output  for  pattern  p  may  be  expressed,  conveniently,  in  vector  notation  as 

10')  =  A\<f)  +  |A0>  (3) 

A  is  an  n'  x  no  array  of  weight  values  between  the  n0  hidden  units  and  the  n' 
output  units 


’  ^11 

•  •  •  A^j 

• 

•  • 

• 

•  • 

•  •  •  A^ni . 

(4) 
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|A0)  is  the  n'-dimensional  column  vector  of  bias  values  associated  with  the 
output  units  and  l^)  is  the  n0-dimensional  column  vector  of  output  values  of 
the  hidden  units. 


For  P  patterns  the  output  matrix  may  be  expressed  as 


0  =  |A0)(1|  +  AH 


(5) 


where  (l|  is  a  P-dimensional  row  vector  of  l’s,  H  is  the  n0xP  array  of  hidden 
unit  outputs 


H  = 


(6) 


and  O  is  a  matrix  having  the  output  vectors  \Op )  as  columns. 


The  adjustable  parameters  (contained  explicitly  in  |A0)  and  A  and  implic¬ 
itly  in  H)  are  determined  by  minimising  the  sum  squared  error  at  the  network 
output 

E  =  £  II  |2*>  -  10')  ||J  (7) 

p=i 

where  ||...||  denotes  the  Euclidean  vector  norm.  Differentiating  this  ex¬ 
pression  with  respect  to  the  bias  weights,  |A0)  and  equating  to  zero  gives  the 
optimum  choice  of  values  for  the  bias  vector  as 


|A0>  =  |T)  -  Aim")  (8) 

where  |T)  =  T|l)/P  is  the  mean  target  vector  over  all  training  patterns,  T  is 
a  matrix  having  target  vectors  \TP)  as  columns,  and  | mH)  is  the  mean  pattern 
vector  over  all  the  patterns  in  the  training  set  produced  at  the  output  of  the 
final  hidden  layer  of  nodes. 

It  is  clear  from  equation  (8)  that  the  role  of  the  bias  vector  is  to  compensate 
for  the  difference  between  the  mean  of  the  desired  target  vectors  and  the  mean 
of  the  actual  outputs  over  the  training  patterns.  Inserting  this  expression 
for  the  bias  into  the  error  expression  gives 

E=||T-AH||J.  (9) 

where  || ...  jjjr  is  the  Frobenius  matrix  norm,  T  =  T  —  |T)(1[  is  the  mean- 
shifted  matrix  of  targets  and  H  =  H  —  |mH)(l|  is  the  mean-shifted  matrix  of 
outputs  at  the  final  hidden  layer. 
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One  solution  for  A  which  minimises  the  above  error  and  gives  a  solution 
with  minimum  (Frobenius)  norm  is  given  by 

A  =  TH+  (10) 

where  H+  is  the  P  x  no  Moore-Penrose  pseudo-inverse  of  H  [9]. 

Thus  using  these  optimum  values  of  the  weight  matrix  and  bias  vector, 
the  expression  for  the  general  output  jO)  of  the  trained  network  for  an  input 
pattern  giving  rise  to  the  pattern  vector  jh)  at  the  output  of  the  final  hidden 
l&ycr  is 

I O)  =  |T)  +  TH+  (|k)  -  |m*))  (11) 

We  can  sum  the  outputs  by  multiplying  on  the  left  by  (1|,  thus 

Sum  over  outputs  =  (1|0)  =  (1|T)  +  (1|TH+  (jh)  —  Im^))  (12) 

However,  consider  the  bra  vector  (1|T  in  the  situation  when  the  sum  of 
each  column  of  T  is  a  constant  (=  t  say); 

<i|i  = 


(1|T  —  (1|T)(1| 

dix-M 


«ll-( 


(13) 


<0! 


since  (1|1)  =  P. 

Therefore,  the  sum  over  the  outputs  of  the  trained,  optimised  network  is 
given  by 

Sum  over  outputs  = 

(1|0)  =  (1|T)  (14) 

=  Sum  of  mean  target  vector. 


This  completes  the  proof  of  the  original  observation.  In  particular,  for  a 
one-£rom-n'  coding  scheme  where  the  components  of  the  target  vector  belong 
to  {1,0},  the  sum  of  the  outputs  of  the  network  for  any  input  sum  to  unity. 
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However,  it  is  now  apparent  that  a  more  general  result  holds: 
Theorem. 


Consider  a  network  having  linear  output  units.  Let  the  weights  as¬ 
sociated  with  the  connections  to  these  units  be  determined  by  linear 
minimum  norm  least  squares  optimisation.  Then  if  there  exists  an  ar¬ 
bitrary  linear  constraint  of  the  form 

(tt|7»)  =  (tt|T)  Vp=l,2,...,P 

with  (u|  a  constant  vector,  then  the  general  output  of  the  network  \0) 
satisfies: 

(u|0)  =  (u|T> 


Proof 

The  general  output  of  the  network  is  given  by  equation  11: 

|0}  =  |T)  +  TH+  (|fc)  -  |m*}) 

Therefore 

(u|0)  =  (u\T)  +  (uTH+  (| h)  -  |m*)) 

But 

<«|T=(u|T-(u|T)(1| 

By  hypothesis,  (u|T  =  («|T){1|.  Therefore 

(u|0)  =  (u|!T)  ■ 

Remark:  If  the  set  of  target  vectors  satisfy  several  linear  constraints 

simultaneously,  then  so  will  the  general  network  outputs. 
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Abstract 


Take  a  feed-forward  layered  network  (such  as  a  multilayer  perceptron  or 
a  radial  basis  function  network)  which  is  to  operate  as  a  pattern 
classifier.  The  network  may  have  several  hidden  layers,  as  many  nodes  as 
required  and  any  desired  nonlinearities  on  the  hidden  units.  The  transfer 
functions  of  the  output  nodes  should  be  linear.  If  the  network  is  trained 
(using  any  appropriate  problem)  to  minimise  the  sum  squared  error  over  all 
outputs  and  patterns  such  that  the  output  weights  have  minimum  norm,  then 
the  output  values  of  the  trained  network  for  any  subsequent  input  pattern 
will  sum  to  a  constant. 
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